On some document clustering algorithms for data mining

نویسندگان

D. S. Zeimpekis

E. Gallopoulos

چکیده

We consider the problem of clustering large document sets into disjoint groups or clusters. Our starting point is recent literature on effective clustering algorithms, specifically Principal Direction Divisive Partitioning (PDDP), proposed by Boley in [1] and Spherical k-Means (“S–kmeans” for short) proposed by Dhillon and Moda in [4]. In this paper we study and evaluate the performance of these algorithms and propose specific refinements. We also explore the effectiveness of PDDP for various partitioning and termination rules. Finally we present results that demonstrate the effectiveness of both PDDP and Skmeans, for the computation of low rank matrix approximations. Document clustering is heavily used in many fields including data mining and information retrieval. The vector space model in document clustering uses an m×n, e.g. (term) by (document) or (term frequency by document) matrix, where m is the number of terms or attributes and n the number of documents. Clustering algorithms can be divided into “partitional” and “hierarchical”. Examples of the former are the k-means algorithm [5] as well as its aforementioned variation, S–kmeans. The latter, hierarchical class includes algorithms that produce clusters via a recursive “agglomerative” (bottom-up) or “divisive” (top-down) process. A recent effective divisive algorithm from this class is PDDP. k-means and its aforementioned variant are easy to implement and appear to lend themselves well to parallelization [3]. On the other hand, they are prone to converge to solutions that are only locally optimal (finding a global minimum is NP-complete) and

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Clustering and Ranking University Majors using Data Mining and AHP algorithms: The case of Iran

Abstract: Although all university majors are prominent and the necessity of their presences is of no question, they might not have the same priority basis considering different resources and strategies that could be spotted for a country. This paper focuses on clustering and ranking university majors in Iran. To do so, a model is presented to clarify the procedure. Eight different criteria are ...

متن کامل

A Multi-Objective Approach to Fuzzy Clustering using ITLBO Algorithm

Data clustering is one of the most important areas of research in data mining and knowledge discovery. Recent research in this area has shown that the best clustering results can be achieved using multi-objective methods. In other words, assuming more than one criterion as objective functions for clustering data can measurably increase the quality of clustering. In this study, a model with two ...

متن کامل

بررسی مشکلات الگوریتم خوشه بندی DBSCAN و مروری بر بهبودهای ارائه‌شده برای آن

Clustering is an important knowledge discovery technique in the database. Density-based clustering algorithms are one of the main methods for clustering in data mining. These algorithms have some special features including being independent from the shape of the clusters, highly understandable and ease of use. DBSCAN is a base algorithm for density-based clustering algorithms. DBSCAN is able to...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

On some document clustering algorithms for data mining

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Clustering and Ranking University Majors using Data Mining and AHP algorithms: The case of Iran

A Multi-Objective Approach to Fuzzy Clustering using ITLBO Algorithm

بررسی مشکلات الگوریتم خوشه بندی DBSCAN و مروری بر بهبودهای ارائه‌شده برای آن

عنوان ژورنال:

اشتراک گذاری